Subscription services are leveraged by companies across many industries, from fitness to video streaming to retail. One of the primary objectives of companies with subscription services is to decrease churn and ensure that users are retained as subscribers. In order to do this efficiently and systematically, many companies employ machine learning to predict which users are at the highest risk of churn, so that proper interventions can be effectively deployed to the right audience.
In this project, I will be tackling the churn prediction problem on a very unique and interesting group of subscribers on a video streaming service.
The data is available here.
The data is already split into train and test set.
train.csv contains 70% of the overall sample (243,787 subscriptions to be exact) and importantly, will reveal whether or not the subscription was continued into the next month (the “ground truth”).
The test.csv dataset contains the exact same information about the remaining segment of the overall sample (104,480 subscriptions to be exact) except the actual churn data. The goal of this project is to make prediction on each customer's churn.
import pandas as pd
data_descriptions = pd.read_csv('data_descriptions.csv')
pd.set_option('display.max_colwidth', None)
data_descriptions
! pip install plotly
import pandas as pd
import numpy as np
from sklearn.metrics import roc_auc_score
from sklearn.model_selection import train_test_split, RandomizedSearchCV
from sklearn.dummy import DummyClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, confusion_matrix, precision_score, recall_score, ConfusionMatrixDisplay
from scipy.stats import randint
from matplotlib import pyplot as plt
import seaborn as sns
import plotly.express as px
%matplotlib inline
train_df = pd.read_csv("train.csv")
print('train_df Shape:', train_df.shape)
train_df.head()
test_df = pd.read_csv("test.csv")
print('test_df Shape:', test_df.shape)
test_df.head()
train_df.isna().any()
train_df.describe()
train_df.dtypes
Check the classes for each categorical variable.
train_df["SubscriptionType"].unique()
train_df["PaymentMethod"].unique()
train_df["PaperlessBilling"].unique()
train_df["ContentType"].unique()
train_df["MultiDeviceAccess"].unique()
train_df["DeviceRegistered"].unique()
train_df["GenrePreference"].unique()
train_df["Gender"].unique()
train_df["ParentalControl"].unique()
train_df["SubtitlesEnabled"].unique()
Now, dummy code the categorical variables except CustomerID.
train_df_new = train_df
train_df_new = train_df.drop("CustomerID", axis=1)
categorical_columns = ['SubscriptionType', 'PaymentMethod', 'PaperlessBilling', 'ContentType', 'MultiDeviceAccess',
'DeviceRegistered', 'GenrePreference', 'Gender', 'ParentalControl', 'SubtitlesEnabled']
for col in categorical_columns:
dummies = pd.get_dummies(train_df_new[col], prefix=col)
train_df_new = pd.concat([train_df_new, dummies], axis=1)
train_df_new = train_df_new.drop(categorical_columns, axis=1)
Check the proportion of churn in the training set.
fig = px.histogram(train_df['Churn'].astype(str), color=train_df['Churn'].astype(str), title="Churn count")
# Calculate percentages
total_count = len(train_df)
percentage_labels = [(f'{churn}: {count / total_count:.2%}') for churn, count in train_df['Churn'].value_counts().items()]
# Add percentage as a caption
fig.update_layout(
annotations=[
dict(
text=', '.join(percentage_labels),
showarrow=False,
x=0.5,
y=1.1,
font=dict(size=12),
)
]
)
# Show the plot
fig.show()
In this project, I decided to use random forest classifier because
X_train = train_df_new.drop("Churn", axis = 1)
y_train = train_df_new["Churn"]
model = RandomForestClassifier()
model.fit(X_train, y_train)
feature_importances = pd.Series(model.feature_importances_, index=X_train.columns).sort_values(ascending=False)
fig = px.bar(feature_importances)
fig.show()
There are clear differences between the important and non important variables. For the ones with importance > 0.04, check if there are any correlations between one another.
plt.figure(figsize=(16, 6))
important_variables = feature_importances[0:9].index
sns.heatmap(train_df[important_variables].corr())
The heatmap shows that there are correlations between "AccountAge" and "TotalCharges," and "MonthlyCharges" and "TotalCharges" which makes a lot of sense. Remove all the unimportant variables as well as "TotalCharges" from the model and run a new model.
important_variables = important_variables.drop("TotalCharges")
X_train = train_df_new[important_variables]
y_train = train_df_new["Churn"]
model2 = RandomForestClassifier()
model2.fit(X_train, y_train)
feature_importances = pd.Series(model2.feature_importances_, index=X_train.columns).sort_values(ascending=False)
fig = px.bar(feature_importances)
fig.show()
Now that we identify the variables to include, we will move on to hypertune the model by determining the best parameters
param_dist = {'n_estimators': randint(50,500),
'max_depth': randint(1,20)}
rf = RandomForestClassifier()
rand_search = RandomizedSearchCV(rf,
param_distributions = param_dist,
n_iter=5,
cv=5)
rand_search.fit(X_train, y_train)
best_rf = rand_search.best_estimator_
print('Best hyperparameters:', rand_search.best_params_)
Using the model with the best parameters, we will finally make predictions on the test data.
X_test = test_df[important_variables]
y_pred = best_rf.predict(X_test)
prediction_df = pd.DataFrame({'CustomerID': test_df['CustomerID'], 'preds': y_pred})
fig = px.histogram(prediction_df.preds)
fig.show()
prediction_df.preds.value_counts()
result = test_df
result['preds'] = prediction_df['preds']
Check how predicted churn is different for the variables with importance
fig = px.box(x = result.preds, y = result.AverageViewingDuration)
fig.show()
You can see that the users who churned spend shorter amount of time watching a show/movie. Possible explanations are
fig = px.box(x = result.preds, y = result.ViewingHoursPerWeek)
fig.show()
Similarly, the churned users spent significantly less time using the streaming service.
fig = px.box(x = result.preds, y = result.MonthlyCharges)
fig.show()
The churn group pays higher subscription fee, which is a resonable incentive to cancel the plan.
fig = px.box(x = result.preds, y = result.AccountAge)
fig.show()
The users in the churned group did not use the platform for a long period of time.
fig = px.box(x = result.preds, y = result.UserRating)
fig.show()
The user rating is similar regardless of the churn, which suggests the quality of the content are not the main reason of their cancellation.
fig = px.box(x = result.preds, y = result.ContentDownloadsPerMonth)
fig.show()
The users in the churned group do not take advantage of content downloading or might not be aware of the functionality, which explains their unsatisfaction.
fig = px.box(x = result.preds, y = result.WatchlistSize)
fig.show()
The size of the watch list is not different between the groups.
fig = px.box(x = result.preds, y = result.SupportTicketsPerMonth)
fig.show()
The number of support tickets are larger for the churned group. We can assume that they had trouble with the service or concern about their plan.
There are so many factors that led to the customer churn. The boxplots above illustrate how some of the vari
In this project, Random Forest Classifer was used to predict the customer churn on subscription service. First, the exploratory data analysis showed the potential relationship with the variables and the churn. Then, the model was developed and fine turned. The model used average viewing duration, viewing hours per week, montly charges, account age, user rating, content downloads per month, watchlist size, and support tickets per month as the variables with high importance. Since the test data does not contain the correct predictions, unfortunately, we were unable to see the prediction accuracy and the model performance. \
As the next step, we would like to test out different models, such as logistic regression, k-nearest neibors, decision tree classifier, and support vector machine, and compare the model performance.